DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
作者信息
DeepSeek团队,在CoRR 2024上的论文
摘要:
In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters【MoE可以在scale up的时候减少计算成本】. However, conventional MoE architectures like GShard, which activate the top-K out of N experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge【但专家存在不够专业化的问题】. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts【允许更灵活的专家组合】; (2) isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts【提出了一个共享专家的概念】. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.
总结概括
Introduction
MoE目前存在一个knowledge hybridity【杂交】 and knowledge redundancy【冗余】的问题,限制了专家获取不重叠的知识。
knowledge hybridity 一般只有8到16个专家,可能会导致专家的知识比较泛化,不够专业化
knowledge redundancy 不同专家的知识可能有重叠的领域
万字解析DeepSeek MOE架构——从Switch Transformers到DeepSeek v1/v2/v3 - 知乎